feat: Add read support for Parquet bloom filters #2653

ForeverAngry · 2025-10-22T00:49:16Z

Closes #2649

Rationale for this change

Add support for using bloom filters in the read path of pyiceberg.

Are these changes tested?

Yes.

Are there any user-facing changes?

I dont think so.

Fokko

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations.
As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

Fokko · 2025-10-22T06:38:48Z

pyiceberg/manifest.py

+        NestedField(
+            field_id=146,
+            name="bloom_filter_bytes",
+            field_type=MapType(key_id=147, key_type=IntegerType(), value_id=148, value_type=BinaryType()),
+            required=False,
+            doc="Map of column id to bloom filter",
+        ),


We cannot just add a field; this requires a spec change: https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals

@Fokko take a look now, i changed the spirit of the PR so that it:

doesn't modify the Iceberg specification

Doesn't change any existing behavior

Rather, this pr just provides the initial utilities needed to read bloom filters from Parquet files at the file level.

If merged, next steps would be integrating them into the read path.

ForeverAngry · 2025-10-22T14:46:37Z

@Fokko I was kinda thinking that when I submitted it. But, it was done, and I figured id just send it to see if it sparked any interest.

That's good information though, sometimes I forget about the governance structures that exist for these projects.

ForeverAngry · 2025-10-22T21:49:17Z

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations. As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

Add utility functions to read and check bloom filters directly from Parquet files using PyArrow, without requiring Iceberg spec changes. - get_parquet_bloom_filter_for_column(): Extract bloom filter from Parquet row group - bloom_filter_might_contain(): Check if value might be in bloom filter This provides foundation for future bloom filter integration without modifying the Iceberg manifest specification.

Fokko · 2025-11-07T19:42:29Z

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

This is correct, and it would give some optimization. However, bloom filters need to be tuned based on the cardinallity of the column. The price to pay is that we encode more information in the manifest files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

I would expect Pyarrow to do this automatically 🤔

ForeverAngry · 2025-11-07T19:46:12Z

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

This is correct, and it would give some optimization. However, bloom filters need to be tuned based on the cardinallity of the column. The price to pay is that we encode more information in the manifest files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

I would expect Pyarrow to do this automatically 🤔

Fair enough, if the "Juice isnt work the squeeze" as they say, then no big deal 😄

Fokko requested changes Oct 22, 2025

View reviewed changes

ForeverAngry force-pushed the feature/bloom-filter-read-support branch from ee5901f to d6a123b Compare November 7, 2025 17:54

ForeverAngry force-pushed the feature/bloom-filter-read-support branch from d6a123b to 55e8516 Compare November 7, 2025 18:00

ForeverAngry requested a review from Fokko November 7, 2025 18:17

ForeverAngry closed this Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add read support for Parquet bloom filters #2653

feat: Add read support for Parquet bloom filters #2653

Uh oh!

ForeverAngry commented Oct 22, 2025

Uh oh!

Fokko left a comment

Uh oh!

Fokko Oct 22, 2025

Uh oh!

ForeverAngry Nov 7, 2025

Uh oh!

ForeverAngry commented Oct 22, 2025

Uh oh!

ForeverAngry commented Oct 22, 2025

Uh oh!

Fokko commented Nov 7, 2025

Uh oh!

ForeverAngry commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add read support for Parquet bloom filters #2653

feat: Add read support for Parquet bloom filters #2653

Uh oh!

Conversation

ForeverAngry commented Oct 22, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ForeverAngry Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ForeverAngry commented Oct 22, 2025

Uh oh!

ForeverAngry commented Oct 22, 2025

Uh oh!

Fokko commented Nov 7, 2025

Uh oh!

ForeverAngry commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants